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Abstract 

Chemical liabilities, such as adverse effects and toxicity, play a significant role in modern drug discovery process. 
In silico assessment of chemical liabilities is an important step aimed to reduce costs and animal testing by 
complementing or replacing in vitro and in vivo experiments. Herein, we propose an approach combining several 
classification and chemography methods to be able to predict chemical liabilities and to interpret obtained results 
in the context of impact of structural changes of compounds on their pharmacological profile. To our knowledge 
for the first time, the supervised extension of Generative Topographic Mapping is proposed as an effective new 
chemography method. New approach for mapping new data using supervised Isomap without re-building models 
from the scratch has been proposed. Two approaches for estimation of model's applicability domain are used in 
our study to our knowledge for the first time in chemoinformatics. The structural alerts responsible for the negative 
characteristics of pharmacological profile of chemical compounds has been found as a result of model interpretation. 

Keywords: Cheminformatics, Chemography, Applicability domain, Generative topographic mapping, Dimensionality 
reduction, Supervised generative topographic mapping, Isomap, Supervised Isomap 



Background 

During the past decade, computational technologies and 
predictive tools have been deeply integrated in the mod- 
ern drug discovery process and changed this process 
extracting the useful knowledge embedded in the com- 
plex arrays of chemical and biological information to se- 
lect the most promising compounds as early as possible 
and to reveal chemical liabilities in order to reduce the 
risk of late stage attrition [1,2]. Chemical liabilities, such 
as adverse effects and toxicity, play a significant role in 
modern drug discovery process. Methods to avoid or re- 
duce chemical liabilities are an important target for drug 
discovery and development. Herein, we propose an ap- 
proach combining several classification and chemogra- 
phy [3] methods to assess chemical liabilities in silico 
and to interpret obtained results in the context of 
impact of structural changes of compounds on their 
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pharmacological profile. Model development has been 
performed in six different descriptor spaces for mutage- 
nicity, carcinogenicity, acute toxicity and phospholipido- 
sis data sets. A set of machine learning methods has 
been involved in model development encompassing 
well-known approaches with new ones. The combination 
of classification and data visualization is a key point for 
mechanistic model interpretation which allows one to 
understand which changes of the existing structures are 
required to improve target properties, to generate new 
hypothesis and, finally, to optimize the chemical struc- 
tures. Over the years, a number of dimensionality reduc- 
tion approaches [4-11] have been proposed and used in 
cheminformatics. The most known and widely used 
among these methods are Principal Component Analysis 
[12], Multidimensional Scaling (MDS) [13,14], Self- 
Organizing Maps (SOM) [15], Stochastic Proximity Em- 
bedding [16-18], Stochastic Neighbor Embedding [19,20], 
Sammon Mapping [21] and Generative Topographic Map- 
ping (GTM) [22-24]. In this study, Generative Topo- 
graphic Mapping and Isomap as well as their supervised 
extensions have been involved. Recently, the unsupervised 
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implementations of these approaches have been used in a 
number of studies in chemoinformatics [25-32]. These 
two representatives of nonlinear dimensionality reduction 
methods are related to two different families: distance- 
based approaches and topology based approaches. Isomap 
reduces the dimensionality of data by using distance pres- 
ervation as the criterion, that is intuitively understandable 
and easy to compute. GTM is related to the topology 
based techniques. This group of methods tries to preserve 
topology principle that is concerned to relative proxim- 
ities: compounds which are close in the data space remain 
close in the data visualization model. Topology preserva- 
tion usually is considered as more powerful and in the 
same time more complex comparing with distance preser- 
vation [4]. The comparison of used techniques on the 
considered data is performed in this study. Support vector 
machines (SVM) [33], GTM and probabilistic neural net- 
works (PNN) [34] have been used for the development of 
classification models. Two applicability domain of models' 
approaches (AD) are involved in our study in order to as- 
sesses the models limitation in prediction of new data in 
order to reliably predict those data that are structurally 
similar to the training set compounds used for model de- 
velopment. Recently, several different AD approaches have 
been proposed [35-49]. Here, we use the representatives 
of two families of AD methods: distance-based (Ball) [50] 
and probability-based (Local Outlier Factor LOF) [51]. 

Here, to our knowledge for the first time, we propose 
supervised extension of Generative Topographic Maps 
[52] that can be used as a universal tool to visualize the 
chemical space and to develop classification models. 
New approach for projecting new data using supervised 
Isomap [53] without re-building models from the scratch 
has been developed. The evaluation of the performance of 
the dimensionality reduction techniques and introduced 
descriptor spaces to separate different activity classes has 
been monitored by three parameters, two of them have 
been used in cheminformatics for the first time. 

Materials and methods 

Data preparation 

Data preparation has been carried out using recommen- 
dations published in [54]. Chemaxon Standardizer [55] 
and Instant JChem [56] software have been used for the 
data preparation. Using Standardizer, the explicit hydro- 
gen atoms have been removed, the structures have been 
aromatized and neutralized. Four data sets have been 
used in our study. 

Mutagenicity 

Ames mutagenicity data from a study by Kazius et al. 
[57]. The data set contained 2367 active and 1888 in- 
active compounds. External test set consists of 1164 ac- 
tive and 2167 inactive compounds. 



Carcinogenicity 

Data was collected from the distributed ISSCAN Data- 
base (part of structure-searchable toxicity DSSTox 
public database network [58]). The database has been 
specifically designed as an expert decision support tool 
and includes the carcinogenicity classification "calls" to 
guide the application of SAR approaches. Collected data 
set encompass 1088 chemical structures containing 648 
compounds annotated as actives and 440 as inactive 
compounds. External test set [25] contains 359 actives 
and 141 inactives. 

Phospholipidosis 

A set of 100 phospholipidosis-inducing compounds and 
82 negative drug-like compounds were taken from [59], 
where the active compounds have been observed to act 
on a range of species (humans, rats, mice, dogs, rabbits, 
hamsters and monkeys) and on a variety of tissue types 
(lungs, kidney and liver). External test set from [60] con- 
tains 141 active and 359 inactive compounds. 

Acute toxicity 

Data from EPA Fathead Minnow Acute Toxicity Data- 
base [61] after data preparation stage containing 612 
compounds (578 actives and 34 inactives). This database 
was generated by the U.S. EPA Mid-Continental Ecology 
Division (MED) for the purpose of developing an expert 
system to predict acute toxicity from chemical structures 
based on mode of action considerations. 

Descriptors 

In this study, six descriptor types have been involved in 
model development. ISIDA package [62] has been repre- 
sented by two different descriptor types: (i) ISIDA 
Property-Labeled Fragment Descriptors (IPLF) [63] 
(atom-centered fragments (augmented atoms) of radius 
1 to 3 colored by pH-dependent pharmacophores and 
(ii) subclass of ISIDA Substructural Molecular Frag- 
ments (SMF) [62] consisting of the shortest topological 
paths with explicit representation of only terminal atoms 
and bonds, where the values of minimal n min and max- 
imal n max number of atoms varied from 2 to 15. 2D de- 
scriptors of Molecular Operating Environment (MOE 
2D) [64] containing different physical properties, subdi- 
vided surface areas, atom and bond counts, Kier&Hall 
connectivity and Kappa shape indices, adjacency and 
distance matrix descriptors, pharmacophore feature de- 
scriptors and partial charge descriptors were involved in 
model development. The CDK (Chemistry Development 
Kit) MACCS keys and extended fingerprints (EF) were 
computed using the RCDK package [65] of the R soft- 
ware [66]. Finally, Dragon software [67] has been used 
for molecular descriptors calculations. Constant and 
nearly constant descriptors were removed. Detailed table 
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with the final number of descriptors for each data set and 
descriptor type is represented in supporting information. 

Methods 

Classification methods 
Support Vector Machines (SVM) 

SVM [68,69] is a supervised learning method commonly 
used for classification and regression and based on stat- 
istical learning theory of Vapnik-Chervonenkis [70,71]. 
Projecting the original data described by means of de- 
scriptor vectors to a higher dimensional feature space 
SVM achieves distinct separation between considered 
classes of compounds finding the optimal position of the 
separating hyperplane between the instances from the 
classes. 

Generative Topographic Mapping (GTM) 

GTM is a specific unsupervised density network based 
on generative modeling. It can be considered as prob- 
abilistic extension of Kohonen Self-Organizing Maps. 
Like SOM, it operates with a grid of K nodes, which can 
be considered as analogs of nodes in SOM. GTM creates 
a generative probabilistic model in the high-dimensional 
data space RD by placing a radially symmetric Gaussian 
with zero mean and inverse variance ft around projec- 
tions of the latent space centers which approximating 
the data density. The nonlinear GTM transformation 
from the latent space to the data space is defined using a 
Radial Basis Function (RBF) network. Thus, each node is 
projected to the center of Gaussian belonging to the 
manifold (two-dimensional flexible sheet located in the 
high-dimensional space in such a way to cover the data 
points by stretching or compressing) embedded in the 
data space. This manifold can be considered as a repre- 
sentation of the latent space in the data space. The coor- 
dinates of the Gaussians are computed as a linear 
combination of Gaussian basis functions and for the 
point x in the latent space its projection to the data 
space can be defined as: 



y = W(p(x) 



(1) 



where W- the output weights of RBF. 

It relates the real data in the chemical space with 
manifold points. Thus, any point of the latent space R L 
has its own projection in a data space R D obtained by 
non-linear parameterized mapping y(x, W). 

The mapping function y{x, W) is continuous, which 
leads to the topographic ordering of the projected 
points, i.e. two points that are close in the latent space 
are also close in the data space. Defining a probability 
distribution over the latent space induces the corre- 
sponding distribution over the manifold in the data 



space and, thus, imposes the probabilistic relationships 
between two spaces. 

The iterative Expectation-Maximization algorithm 
(EM-algorithm) is used to find the parameters of RBF 
network (W and ft) maximizing the, so called, log likeli- 
hood function which measures a correspondence be- 
tween the data distribution and the model. 



n=l [ K i=l ) 



(2) 



where L - log likelihood function, ft - inverse of variance, 
W - the output weights of RBF, K - number of the nodes, 
N - number of compounds, p(t n \x h W, ft) - prior probability 
generated in a point t n in the data space by the Gaussian 
with a center in y(x» W). 

Activity profile of a chemical compound can be assessed 
starting from the values of the class-conditioned probabil- 
ity distribution function p(t\C k ) computed for each class 
Cjo where t is its molecular descriptor vector. Such func- 
tion can be build, for each activity class, by training a sep- 
arate GTM model on the data belonging to class C k . The 
class-conditioned probabilities p{t\C k ) can be used for 
computing posterior probabilities of class membership 
P(C k \t) for a given compound using the Bayes theorem: 



P{C k \t) 



p{t\C k ).P{C k ) 
Pi*) 



(3) 



where P(C k ) = is a prior probability of class mem- 
bership (N k - the number of compounds belonging to 
class C k ; N tot - the total number of compounds), 
whereas p(t), the marginal probability density function, 
is the normalization factor: 



p{t) = Y J P(t\C k )-P{C k ) (4) 



The latter ensures that the estimated posterior prob- 
abilities are normalized. By applying Function 3 to each 
class C k one can assess the posterior probability of class 
membership for each compound. According to statistical 
decision theory [72], the optimal class assignment is de- 
termined by the maximal value of posterior class prob- 
abilities P(C k \t). 

Probabilistic Neural Networks (PNN) 

PNN [34] belongs to a group of feed-forward neural net- 
work algorithms. It was derived from Bayesian Networks 
[73] and Kernel Discriminant Analysis [74]. PNN con- 
sists of four layers: input layer, pattern layer, summation 
layer and output layer. 

An input layer represents the input vector, e.g. a com- 
pound from a test set. Each compound is attributed to a 
single neuron of pattern layer, for which its descriptors 
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represent a weight vector. Therefore all pattern neurons 
can be marked with the class labels of corresponding 
compounds. Input layer interconnected with a pattern 
layer, thus each pattern unit forms a dot product Z of an 
input vector and its weight vector. Z is propagated to 

Z-l 

the network activation function e * 2 and the result is out- 
putted to the summation layer. Each neuron in the sum- 
mation layer is connected to pattern units of the 
corresponding class. This layer performs simple summa- 
tion of the inputs from the pattern layer. The output 
layer is a two-input layer, which produces a binary out- 
put. It takes into account the contribution for each class 
of inputs. The output is a 1 (positive identification) for 
that class and a 0 (negative identification) for non- 
targeted classes. In fact, there's no training required 
since the compounds of the training set are considered 
as the weights to the hidden layer of the network. As no 
training required, classifying an input vector is fast, de- 
pending on the number of classes and compounds in 
use. PNNs have some advantages comparing with multi- 
layer perceptron networks: they are faster, relatively in- 
sensitive to outliers and generate probability scores. 

Dimensionality reduction methods 

Supervised Generative Topographic Mapping (s-GTM) 

GTM performs visualization by inversing mapping from 
the data space to the latent space (unbending this flex- 
ible sheet into the rectangular 2D map). For this Bayes 
theorem is used. Thus, for each molecule GTM calcu- 
lates its probability to be located in the given point of 
this map represented by the latent space and visualizes 
this molecule according this probability. 

In order to make manifolds location in the data space 
dependent on distribution not only of the whole data 
set, but also of each class, a new supervised training pro- 
cedure was performed. Each iteration consists of two 
major steps. 

On the first step latent points are ascribed to one of 
the data classes in consideration. To this end we calcu- 
late responsibilities r/ <n (i.e. posterior probabilities that 
data point t n was generated by the component x^). 

rfa =^.,W^=-#!^M^- (5) 



X^ 1 /»(f«|**,W ) j8)/>(*<) 



where p(x k ) = 

For each latent point % the following sums are 
calculated 



1 '' 

V (=1 



(6) 



where index / refers to one of the classes, A/, - is a num- 
ber of compounds in this class. 



The latent point is associated with the class with the 
largest sum of responsibilities, only in a case when the 
difference between the sums is greater than the thresh- 
old value thr, which is an external parameter of the 
method. If not, the latent point remains unlabeled. To 
assure the formation of clusters of similarly labeled la- 
tent points, the influence of neighbor latent points is 
taken into account by decreasing the threshold value if 
the latent point on previous iteration had neighbors as- 
sociated with the class which responsibility sum is larger 
on the current iteration and increasing it if the neigh- 
bors are from the opposite class. 

The second step contains movement of the latent 
points projections towards data points of the corre- 
sponding class by adjusting the RBF network. The sum 
T of vector distances from the latent point x^ projection 
to all the data points t w for which r^ n > rr, is calculated. 
Here, rr denotes the responsibility radius, another exter- 
nal parameter of s-GTM. If the data point belongs to the 
class opposite to that of the latent point, the correspond- 
ing distance is multiplied by -1 (thereby, the vector form 
the data point to the latent point is obtained). The de- 
sired new coordinates P' of the latent point projection 
are defined the following way: 



N 



(7) 



Then RBF network is trained using the coordinates of 
Xk in the latent space as input and P' as a target. 

Supervised GTM has a number of external parameters 
that have a great influence on the model development. 
Main parameter for latent points' colorization is the 
threshold value. It should be low enough, in order to 
allow a considerable amount of latent points to get la- 
beled. The maximum value can be found from analyzing 
the responsibility matrix and strongly depends on the 
number of latent points: the larger is their number, the 
lower should be the threshold value. 

The influence of the color of neighbor latent point is de- 
fined by additional compound for threshold calculation: 



thr[ l = thr + 



N 2 -N x 



(8) 



where thr is an original value, Ni and N 2 - number of 
neighboring latent points of class 1 and 2 respectively, p - 
an external parameter, thr \j - is a threshold value, specific 
for class 1 and latent point /. This means, that latent point 
/ will be labeled as class 1, if 



Si > S 2 + thr' l t 



(9) 



It is obvious, that parameter p is required to bring 
both terms of Formula 9 to similar scale. Surprisingly, in 
quite a wide range it has small impact on the model, but 
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can be very useful for imbalanced data to prevent all the 
latent point to be marked by the same class label It 
should be altered for fine optimization or in case if no 
similarly labeled clusters of latent points are formed dur- 
ing the training process. 

S-lsomap 

Isomap [75] is a low-dimensional embedding method. It 
implies that data are disposed along a manifold with a 
dimensionality d less than dimensionality d Q of the ori- 
ginal data space. Our aim is to "unroll" the manifold into 
a d-dimensional space, so that data points, which are 
close to each other on the manifold remain close, and 
remote points - stay remote. To this end, we replace 
Euclidian distance with geodesic one - the length of the 
shortest curve between two points that lies on the 
manifold. 

Isomap algorithm consists of three steps. On the first 
step we define k nearest neighbors of each compound 
and assume that Euclidian distances between them are 
small and, thus, are nearly equal to corresponding geo- 
desic distances. This assumption allows us to create a 
weighted graph where only the vertices that are nearest 
neighbors are connected and the length of each edge 
equals the corresponding distance. This graph is not al- 
ways connected and in this case the largest connected 
part is taken for the next step. After the graph has been 
constructed we compute shortest distances between its 
vertices. Then obtained distance matrix is used for 
multidimensional scaling (MDS) [13,14] from original to 
^-dimensional space. To minimize the cost function in 
MDS coordinates of compounds in the new space 
should be set to the top d eigenvectors of the matrix r 
(p) [76], where D is a matrix of pairwise distances be- 
tween training points and r is an operator, that converts 
distances to inner products. For visualization purpose 
we set d = 2. 

Supervised extension of Isomap was proposed in 
[53]. It differs from the original algorithm in its first 
step. Instead of Euclidian distance d{x u xj) between x t 
and Xj a new measurement of compounds' dissimilar- 
ity is calculated. 



Euclidian distance between all pairs of data points is 
usually used. The parameter a gives some chance to the 
points from different classes to be more close to each 
other. 

After new distances have been calculated k-nearest 
neighbors are defined and weighted graph is constructed 
in the way it is done in non-supervised algorithm. 

A way to extend nonsupervised Isomap to new points 
was proposed in [77,78]. There coordinates of new 
points are calculated as 

(11) 

where X k is eigenvalues and V/ d - coordinates of the cor- 
responding eigenvectors of the matrix t(d) , operator 
E x > denotes average over the data set. To make this work 
for S-Isomap we take into consideration Eq. 11, while 
computing D(xi,x) - geodesic distance from and exter- 
nal point x to the training point x t . We assume that the 
distance from x to its k nearest neighbors of x is small 
enough to make not much difference between two parts 
of Eq. 11, and so we can use their average as a geodesic 
distance from x to its k nearest neighbors. Other geo- 
desic distances are found from matrix D by computing 
the shortest paths as it has been done while training the 
model. If value - is too large (which happens when 
average distances between compounds in the original 
data space are much exceed one), additional coefficient 
fti can be used for both training the model and extend- 
ing it to the new points. In this case the parameter ft in 
Eq. 11 is replaced with ft^ft. 

Applicability domain approaches 
Ball 

Ball [50] is a distance-based method for outlier detection. 
It uses If -metric, in which distance between compounds 
x and y in feature is space denoted by Formula 11. 



dist LP (x 1 y) = i^^Xi-y^ 



7, 



(12) 



D{xi,Xi) = < 



d {xi •) Xj ^ 



i -g ^ ,ifyi=yj 

d (xi^Xj^ 
k Ve & -a,ify i ^ty j 



(10) 



Here y t denotes class label of compound x» and ft is a 
parameter that prevents D(x if Xj) from increasing too 
fast, ft should depend on data density and average 



The algorithm optimizes the weight vector w the fol- 
lowing way: 



mmp 

s.t.^^ Wj \xjj-aj \ p < p 
j 

Wj = i, Wj > o 



(13) 



where a is a centroid of the data points and Xy denotes 
the coordinate ; of the compound x t . 
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After w is optimized, the compounds x t for which 
y^Wj\xij-aj\ p is the largest are considered as outliers. In 

other words this method fit L p "ball" around the data. 
This "ball" separates targets from outliers. Figure la 
demonstrates the case of 2-dimensional feature space 
with Wi = w 2 - 



Local Outlier Factor (LOF) 

LOF is a probability based method for outlier detec- 
tion in a multidimensional dataset [51]. It operates 
with local densities of objects in the dataset by using 
the definition of local reachability density and calcu- 
lates value of "local outlier factor" that indicates the 
degree of objects dissimilarity to other compounds in 
the data set. 

To define the local reachability density we should first 
introduce some other concepts. We call /^-distance of 
the object p (dist k (p)) the smallest value for which there 
are at least k objects besides p with a distance from p 
smaller or equal to dist k (p). AT-distance neighborhood of 
an object p (N k (p)) is a set of objects, not including p, 
whose distance from p does not exceed dist^ip). Let us 
specify that the cardinality of Njjip), which we also de- 
note as \Nj^p)\, can be greater than h in case, when in 
Nk(p) exist two or more objects whose distances from 
p are equal to dist^ip). Reachability distance of object 
p with respect to object o (rdist k (p,o)) is the maximum 
value between /^-distance of o and distance from o to 
p. The idea of reachability distance is illustrated in 
Figure lb. 

Local reachability density can be defined as 



lrd k {p) 



oeN k (p) 



rdistkip, oy 



Wkip)\ 



(14) 



The local outlier factor is an average of the ratios of 
the local reachability densities of objects to those of ob- 
jects k nearest neighbors (Eq. 15). 



LOF k (p) 



E 



lrdk{o) 



'oeN k {p) lrd k {p) 

Wk~ip)\ 



(15) 



In [51] is shown, that LOF of objects that lie 'deep' in- 
side a cluster approximately equals to 1. It is also shown 
that in majority of cases k can be chosen so that for all 
objects that belong to some cluster of objects LOF ap- 
proximately equals to one, and for any other object it 
significantly differs from one. This fact allows us to de- 
tect compounds that do not belong to any cluster and so 
can be called outliers. 

Experimental 

The predictive performance of developed classification 
models was assessed using five-fold external cross- 
validation (5-CV) procedure considering Balanced Ac- 
curacy (BA) value [79] as a criterion of the predictive 
performance of the models. BA is an average of two 
other criteria, Sensitivity and Specificity, which were de- 
signed to assess models ability to identify compounds 
from a certain class (active or positive for Sensitivity and 
inactive or negative for Specificity) disregarding its be- 
havior for the other class. The combination of Sensitivity 
and Specificity should be able to compensate possible 
imbalance in the dataset. 



BA = - {Sens + Spec) 



1 



tp 



+ ■ 



tn 



2 \tp +fn tn+ fp 



(16) 



where Sens is Sensitivity, Spec is Specificity, tp stands for 
true positive rate (e.g. the number of correctly predicted 
active compounds), tn - for true negative (correctly pre- 
dicted inactive compounds), fp - for false positive 



a) 





fiir*(p lr o) = disttio) \ 



Figure 1 The methods of applicability domain estimation: a) The LP "ball" of radius one in 2-dimentional space for different values of 
p and w-i = w 2 ; b) the reachability distance of objects p-i and p 2 with respect to object o for k= 3. 
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(inactive compounds thatve been predicted to be active), 
fn - for false negative (active compounds thatve been 
identified as inactive ones). 

LibSVM [80] was used for developing SVM models, 
two its external parameters v and y were varied from 
0.01 to 0.91 and from 2~ n to 2 3 respectively. 

GTM models were built with the help of the Netlab 
[81] package. This implementation can't work with 
large number of descriptors, so the Principal Compo- 
nent Analysis was introduced beforehand. Here, the 
following external parameters were gone over. Number 
of first principal components, that were retained, was 
varied from 20 to 60, number of latent points - from 
5 2 to 50 2 , number of radial basis network centers - 
from 2 2 to 7 2 . 

PNN was implemented in Classification Toolbox for 
use with MATLAB [82]. Its only external parameter 
Gaussian width was chosen from the range of [0; 1]. 

Among all the developed models for each combination 
of dataset, descriptor type and applied method one with 
the highest Balanced Accuracy was selected for further 
analysis. 

For s-GTM the value of threshold was tried from 0 to 
0.2, in most cases we used p = 40 or p = 30. The only ex- 
ternal parameter in the movement step (responsibility 
radius, rr) has a great influence on the model. Too small 
values leads to small changes in the model compared to 
unsupervised GTM, too big - to mapping all com- 
pounds into a single point. This parameter was sorted 
out in a large range. 

The performance of data visualization has been moni- 
tored with three quantitative measures. Each of them is 
normalized to vary from 0 to 1 and can be computed for 
a data set where the information about the classes is 
available. 

T-score 

T-score [26] takes into account k nearest neighbors of 
each projection. The more neighbors of each point be- 
long to the same class, the higher is T-score. Thus, 
this score characterizes the ability of a model to 
produce similar-structure clustering in a visualization. 
To compute T-score one need to take the following 
steps. First, for each compound v/ G(l, k) should be 
computed: 



7=1 



(17) 



where k is the number of nearest neighbors, which is an 
external parameter, g(vi,j) = 1 if the ;th nearest neighbor 
of V/ in the visualization space belongs to the same class 



as v/, g (Vbj) = 0 otherwise. Then for each class i y t (k) is 
defined as 



(18) 



1=1 



where Hi is a number of compounds of class L And fi- 
nally the T-score is 



r(*) = £l>i(*) 



where N is a number of classes. 



(19) 



Distance Consistency (DSC) 

DSC [83] is based on the distances from points to the 
centroid of each class. It is higher when more points are 
closer to the centroid of the corresponding class, then to 
any other. The score is equal to 1, if the model projects 
compounds into separate clusters, one for each class. 
The computation of DSC is similar to the computation 
of T-score, but instead of g{vj> j) the centroid distance 
(CD) is used. Beforehand for each class i one need to 
find the coordinates of its centroid c t . Then CD(v/, q) = 1 
if the closest to v/ is the centroid c t and v/ belongs to 
class i and CD(v/, cj) = 0 otherwise. Then for each class i 



C(i)=-J2CD(v h a) 
m i=i 

1 N 



(20) 
(21) 



i=i 



Distribution Consistency (DC) 

DC [83] estimates the overlapping of classes. It divides a 
map into separate areas and treats them independently. 
For each area the value of entropy is computed, which is 
0, if all the points in the area share one class label, and 
reaches maximum, when every class is represented in 
the area by equal number of points. For DC computa- 
tion the conception of entropy of the region R is to be 
introduced. 



(22) 



Here, p t is a number of molecules of class i in the region 
R. And the value of DC is defined the following way 



(23) 



Pr is the whole number of molecules in the region R 
and a coefficient Z = n log 2 A/' is used to range DC from 0 
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to 1. In this work to obtain the required regions we di- 
vided the visualization map into 15 x 15 equal sized 
rectangles. 

Results and discussion 

Classification models performance 

As one can see in Figure 2, for three out of four consid- 
ered datasets, the best predictive performance was dem- 
onstrated by the Support Vector Machine approach 
(carcinogenicity - 68%, mutagenicity - 83%, phospholi- 
pidosis - 82%). Yet, in prediction of acute toxicity GTM 
significantly outperformed SVM (Balanced Accuracy 
reached 86% for GTM and 75% for SVM models). It is 
also seen that for GTM approach IPLF descriptors 
shown to be less effective than others, while applying 
molecular fingerprints for both SVM and GTM ap- 
proaches led to high values of Balanced Accuracy. The 
behavior of accuracy for acute toxicity predictions sig- 
nificantly differs from those for other data sets. Molecu- 
lar fingerprints here showed nearly the worst results 
among all the types of descriptors in this study (63% for 
SVM and 71% for GTM), while the best predictive 



performance was achieved using descriptors of the MOE 
and Dragon packages. The corroboration and a possible 
explanation of this fact may be given by found in the at- 
tempt of detection of structural alerts that is given 
further in this chapter. Implementation of MACCS de- 
scriptors failed to mark out any fragments that are re- 
sponsible for toxic activity of compounds. The reason 
for this may be the imbalance of this particular data set. 
The deficiency of inactive compounds leads to difficul- 
ties in determining whether the presence of a fragment 
in several inactive compounds is an accident. Though in 
most cases SVM outperforms GTM, the analysis of its 
work is obstructed by the lack of intrinsic information 
about the predictive decisions. GTM, on the other hand, 
not only gives easily interpretable probability distribu- 
tion for each compound, but also can be used as a tool 
for data visualization and outlier detection. 

PNN may be considered a compromise between the 
lack of methods internal information of SVM and the 
decrease of accuracy of GTM. It is not such a universal 
tool as GTM but slightly outperforms it (up to 6% for 
mutagenicity). At the same time, PNN makes less 
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Figure 2 Predictive performance [Balanced Accuracy) of SVM (I), GTM (II) and PNN (III) approaches for different types of descriptors and 
different datasets: acute toxicity (a), carcinogenicity (b), mutagenicity (c), phospholipidisis (d). 
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accurate predictions then SVM, but allows one to look 
through the background of each decision by analyzing 
pattern and decision layers. There is a similarity in be- 
havior of SVM and PNN. 

Dependence of Balanced Accuracy from datasets and 
descriptor types obtained by PNN is turned out to be 
similar to that of SVM, but not of GTM, though both 
PNN and GTM are neural networks. 

The considered data sets were previously studied by 
other teams. Thus, classification of the acute toxicity 
data set has been performed in [84]. The compounds 
have been divided into classes differently than in our 
study and in the original database. A set of different ma- 
chine learning approaches including several types of 
neural networks as well as SVM, Decision Trees and 
Gene Expression Programming have been applied for 
classification purposes. Corresponding Balanced Accur- 
acy values of the developed models varied in the range 
from 0.85 to 0.93. A number of studies [85,86] with the 
regression analysis have been published including the 
original publication of this data set [61]. The carcinogen- 
icity data involved in this study has been used in QSAR 
studies mostly as a source of further data retrieval. It has 
been used, for example, as a part of considered data in 
[87]. A thorough analysis of the mutagenicity data set in- 
cluding the applicability domain estimation has been 
performed in [40]. The direct comparison of the ob- 
tained results performance is straitened because of the 
difference in the statistical parameters used. Comparable 
results (obtained by combination of ECFP descriptors 
with Random Forest and Nearest Neighbor classifiers) 
have been recently reported in [88]. In [59] SVM and 
Random Forest were applied for phospholipidosis pre- 
diction. There Matthews Correlation Coefficient was 
used to assess the results performance, and its values 
varied up to 0.72 that outperforms the maximum value 
of this parameter in our study. 

Predictions of models developed on IPLF, ECFP and 
SMF descriptors were analyzed. The numbers of com- 
pounds containing a certain descriptor d, predicted to 
be active n d act and inactive n d inacV were calculated for each 
descriptor. Then the corresponding fractions of com- 
pounds were calculated as 



fr d 

J act 



d d 
act jyd inact 

M-act VI inact 



(24) 



where n act and n inact are total number of compounds, pre- 
dicted to be active or, respectively, inactive, by the model 
at issue. The rare descriptors with fr d act + frf nact < 0.05 
were excluded from further consideration. Among other 
descriptors were selected for further analysis those with 
fi^- > 2.5. Fragment descriptors used for the prediction 

/ ' inact 



of compounds as actives by all three classification methods 
are demonstrated in Figure 3. 

MACCS descriptors were not effective in detecting 
structural alerts for all data sets, but mutagenicity, where 
eight descriptors detected mostly nitro groups. There 
are limited number of descriptors, which all three 
methods considered to be structural alerts. PNN tends 
to attribute descriptors to structural alerts that may be 
one of the reasons of its inferior efficiency compared to 
SVM. The described approach didn't allow detecting 
structural alerts for phospholipidosis. Though more than 
30 descriptors were unanimously marked by the methods, 
all these descriptors refer to several groups of active com- 
pounds with similar structure (an example is demon- 
strated in Figure 3). 

Performance of data visualization models 

In this study, supervised extensions of Isomap and GTM 
were used for data visualization. 

S -Isomap was first introduced in [53]. It demonstrated 
excellent results in separation different classes of train- 
ing set. Mapping of the external test set is an important 
part of the chemography from the practical point of view 
in the context of the possibility of the application of the 
developed models to virtual screening and to mechanis- 
tic model interpretation which allows one to understand 
which changes of the existing structures are required to 
improve target properties, to generate new hypothesis 
and, finally, to optimize the chemical structures. In the 
original article for mapping an external test set it was 
recommended to use Radial Basis Network. In our study 
it turned out to be ineffective for diverse sets of chem- 
ical compounds. In this study, we propose new approach 
for the application of models to visualizing external data. 
We modified an approach proposed in [77,78] to adapt 
it for s-Isomap (See details in Methods description). The 
results of the mapping of external test sets for three 
types of activities are demonstrated in Figure 4. Herein- 
after visualization maps are presented in the coordinate 
system generated be the applied methods. GTM and 
s-GTM presume that latent space is a rectangle of 
size 2x2 with its center located at (0, 0). Isomap and 
s-Isomap project compounds into two-dimensional 
space so that Euclidean distance (for Isomap) or dis- 
similarity measure (see Eq. 10) (for s-Isomap) can be 
preserved and the map scale is chosen accordingly. 
All axes in Figures 4, 5 and 6 are relative and have 
no units of measurement. 

One can see that while s-Isomap performed almost 
perfect separation of the training set (none of the ap- 
plied assessment parameters decreased below 0.91), the 
quality of mapping an external set for these models is 
highly dependent on the dataset in consideration. An 
external set of mutagenicity was mapped quite accurate 
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Figure 3 Examples of the descriptors frequently used to predict compounds as active by all three applied methods. For each descriptor 
an example of inactive compound is given (if any). 
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Figure 4 S-lsomap data visualization for considered datasets (the maps for the best combination of involved approach and descriptor 
type are given). Each point in the map corresponds to the individual compound (in red, blue - actives, black, green - inactives). In the left 
column the values of visualization quality assessment parameters are presented for the training set, in the right one - for the test set. 
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Figure 5 The visualization maps of acute toxicity dataset (MOE descriptors), obtained by unsupervised (left) and supervised (right) 
GTM. The singled out molecules share similar structure but belong to different classes. They are mapped close to each other by GTM, but are 
distinguishable in the s-GTM visualization. 



(r-score = 0.80, DSC = 0.86, DC = 1.00), while the map- 
ping of external set for carcinogenicity is moderate: the 
corresponding parameters varied in the range 0.54-0.62. 
One of the main factors that determine the quality of 
mapping is the distance from each point of the external 
set to the nearest neighbors in the training set. The 
closer they are, the better are the results. In case, if the 
distances are much greater than one, but are of the same 
scale, the additional parameter fii can be used to put 
them to the desirable range (look [53] for the specific 
values). In the case of carcinogenicity, in particular, the 
distances from the points of the external set differ for 
several degrees. 

The supervised extension of GTM is proposed in this 
paper for the first time. It demonstrates a significant im- 
provement in visualization performance. An example for 
acute toxicity dataset and MOE descriptors is given in 
Figure 5. Besides a noticeable increase in all three used 
visualization quality measures (r-score raised from 0.62 
for unsupervised model to 0.77 for the supervised one, 
DSC - from 0.57 to 0.87 and DC - from 0.85 to 0.95, 
respectively), one can see how structurally similar com- 
pounds related to different classes and close to each 
other on the map obtained by unsupervised GTM are 
separated using supervised extension of GTM. Here, two 
groups were selected, each of them contained structur- 
ally similar active and inactive compounds. The first one 
contains toxic 1-Decanol and non-toxic 1-Tridecanol 
that differ from each other only by the length of the car- 
bon chain (Tanimoto Similarity Coefficient (TSC) is 
equal to 1.00). The second group consists of toxic 2- 
Undecanone and 2-Dodecanone and similar to them 
(TSC = 0.82) non-toxic 3-Tetradecanal. All these com- 
pounds were mapped into a small area by unsupervised 



GTM while well distinguished applying its supervised 
extension. 

Mapping of external test set for s-GTM is performed 
using the same procedure as for GTM, and the corre- 
sponding results are demonstrated in Figure 6. One can 
see that presented visualization maps are inferior to 
those of s-Isomap. At the same time s-GTM performs 
more accurate mapping of the external test set than s- 
Isomap, since after the model has been trained, the 
training set is mapped using the same algorithm as is 
used for the mapping of an external test set. In s-GTM, 
if one includes a compound from the training set in the 
test set, it will projected exactly to the same point of the 
map. This is not so for s-Isomap. Without label informa- 
tion each mapping will be an approximation and can be 
performed in different ways. The one we've proposed is 
based on the assumption that label information does not 
have much influence on the relative location of the 
points that are close to each other. During the training 
process s-Isomap changes distances between compounds 
in different manners regarding if the compounds belong 
to the same class or not but proportionally their relative 
position. Thus, new distances for compounds from dif- 
ferent classes do not change significantly if they are close 
to each other. And if the compound from the test set 
has close neighbors in the training set, they will mapped 
close even if they belong to different classes. In Figure 6, 
as well as in Figure 4, acute toxicity maps are not pre- 
sented since we had no corresponding external set at 
our disposal. Nevertheless, s-GTM demonstrated rea- 
sonably high results visualizing this data set. Considered 
quantitative measures for the best maps varied in the 
following ranges as a function of the descriptors type: I- 
Score - 0.76-0.77; DSC - 0.72-0.87; DC - 0.93-0.96. 
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Figure 6 S-GTM data visualization for considered datasets (the maps for the best combination of involved approach and descriptor 
type are given). Each point in the map corresponds to the individual compound (in red, blue - actives, black, green - inactives). In the left 
column the values of visualization quality assessment parameters are presented for the training set, in the right one - for the test set. 
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The given examples allows one to assume that s-GTM 
tends to form clusters of identically labeled projections 
that is reflected by the increase of the DSC value as 
compared with the results of original GTM. For in- 
stance, for presented in Figure 6 examples the improve- 
ment in DSC is 0.12 for carcinogenicity, 0.18 for 
mutagenicity and 0.21 for phospholipidosis. At the same 
time, while generally s-GTM provides at least slight in- 
crease in all the considered parameters for visualization 
quality assessment, it doesn't separates areas of overlap- 
ping as successfully as s-Isomap does. The reason for 
this is that s-GTM works with the given relative location 
of compounds in the data space, while s-Isomap changes 
the distance between the compounds according to the 
label information (and thus performs some sort of 
metric learning [89]). E.g. if the choice of descriptors 
leads to overlapping differently labeled compounds in 
the original data space, s-GTM may not be able to sep- 
arate them completely, but will project an external set 
following the pattern of the training set, while s-Isomap 



can achieve almost perfect separation for the most diffi- 
cult visualization tasks, but then one may face some 
problems with the mapping of the external set. 

For each presented map (Figures 4, 5 and 6) the values 
of three quantitative measures of visualization perfor- 
mances are given. None of the parameters is perfect and 
can be individually applied for identification of adequate 
data visualization models and comparison of different 
maps. T-score, for example, is high for the maps with 
randomly mixed compounds that are still grouped in 
small clusters. Distance consistency can be low for well 
separated classes that form non-convex figures. Distribu- 
tion Consistency is usually high for imbalanced dataset 
visualization and strongly depends on its external par- 
ameter. The effectiveness of each parameter is defined 
by the nature of obtained map. For example, the maps 
may have similar DC value, but differ in DCS, which can 
be interpreted that considered maps have similar class 
overlapping and different level of clusterization. In this 
study, the combination of DC and DSC parameters 
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Figure 7 Balanced accuracy for SVM models as a function of data fraction. The compounds were removed according to the rate of their 
dissimilarity from the rest of the training set, which was assessed by the applicability domain methods. I - acute toxicity, II - carcinogenicity, III - 
phospholipidosis, IV - mutagenicity. 
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demonstrates its performance. Another advantage of DC 
and DSC is its less time- and memory-consuming com- 
pared to T-score. 

Applicability domain of models 

Two methods of applicability domain estimation were 
applied in this study, their performance was compared. 
One of them is a distance-based Ball, the other - a 
distribution-based LOF. The Principal Component Ana- 
lysis was used as a pre-processing step. Each method 
was used to generate a sorted list of compounds accord- 
ing to their "outlierness" (the value of LOF function for 
LOF and distance to the centroid for ball). The impact 
of outliers' exclusion on the Balanced Accuracy of the 
models was analyzed. 

In Figure 7 the Balanced Accuracy is given as a func- 
tion of data fraction after the exclusion of outliers. The 
nature of the changes is affected by the distribution of 
compounds between classes in the dataset and predict- 
ing performance. In all the presented cases one can see a 
certain growth in performance which is different for 
considered datasets. Thus, the Balanced Accuracy of 
models for predicting carcinogenicity has increased only 
by 2.5% and after almost half of the compounds has 
been removed, while application of LOF to the SVM 
model for phospholipidosis that used SMF descriptors 
yield almost linear growth of Balanced Accuracy from 
79% to 88% (after excluding 0.4 of compounds). 



For acute toxicity LOF proved to be more efficient 
than Ball This can be explained by the presence of sev- 
eral clusters with high density of compounds in the data- 
set containing compounds of different classes. The 
compounds in these clusters may have been correctly 
classified, while a number of false predictions were made 
for the compounds lying in the areas of classes overlap- 
ping in the midst of the clusters. In this case LOF was 
able to detect these mispredicted compounds as outliers 
and Ball just excluded the most distant from the cen- 
troid compounds in spite of the density distribution. 

For phospholipidosis Ball and LOF demonstrated 
similar performance, though LOF is a bit more efficient. 
It may indicate that the data are slightly clusterized with 
an area of clusters' overlapping and most incorrectly 
predicted compounds are located far from the main ag- 
gregation of the chemical structures. 

For carcinogenicity both applied methods demon- 
strated only a small increase of the Balanced Accuracy, 
with a better performance of Ball (in Figure 7 blue lines 
lie above corresponding black lines). This could happen 
if the projection of the dataset into the data space was a 
one cluster with irregular density distribution and large 
area of classes overlapping. 

Similar pattern can be found for mutagenicity. Here, 
the maximum increase in BA is only about 2% and Ball 
only slightly outperforms LOF for IPLF descriptors. In re- 
spect with reasonable performances of both visualization 
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and classification methods for mutagenicity dataset, one 
may assume that this dataset doesn't contain many out- 
liers and applying applicability domain analysis does not 
affect the predictive performance of models. 

To demonstrate the principles of the outlier detection, 
the example of the compounds marked by LOF as the 
most dissimilar to the rest of the phospholipidosis data 
set was provided in Figure 8. The diagram on the upper 
left corner illustrates the effect of their exclusion on the 
Balanced Accuracy of the model obtained by GTM. 

The SMF descriptors we used represent only terminal 
groups (See the section devoted to the descriptor types). 
The presented compounds were considered as outliers not 
because of the presence of some unique fragment, but be- 
cause of unique or rare combination of atoms and bonds 
and their relative location. For example, Ceftazidime is 
the only compound in the dataset that contains sulfur 
with aromatic bond together with distanced heteroatoms 
(from 9 to 15 atoms in a fragment). And only in Rifampin 
there are carbon atoms with double bonds having from 4 
to 10 atoms between them. Not all the given compounds 
are characterized by a number of unique descriptors, but 
all of them contain plenty of rare ones, as, for example, 
Colchicine. 

Conclusions 

This work concerns an approach that combines several 
classification and chemography methods for in silico as- 
sessment of chemical liabilities and for the interpretation 
of obtained results in the context of impact of structural 
changes of compounds on their pharmacological profile. 
Support Vector Machines, Generative Topographic Map- 
ping and Probabilistic Neural Network were used for clas- 
sification. The classification performances were improved 
by combination with two applicability domain assessment 
approaches (Ball and Local Outlier Factor), and their con- 
tribution was analyzed. Here, the supervised extension of 
Generative Topographic Mapping was proposed as new 
efficient chemography method. New approach for map- 
ping new data using supervised Isomap without re- 
building models from the scratch has been proposed. The 
evaluation of the performance of the dimensionality re- 
duction techniques and introduced descriptor spaces to 
separate different activity classes has been monitored by 
three parameters (r-score, Distance Consistency and Dis- 
tribution Consistency) and their efficiency was compared. 
The obtained results, which are comparable with or ex- 
ceed those, published by other teams for the given bio- 
logical activities, allow one to use proposed approach as 
an efficient filter for exclusion of compounds with un- 
desirable activities on early stages of drug design process. 
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